Ensure that comballoc.ml does not reverse the order of allocations #1917

stedolan · 2018-07-19T11:33:48Z

Since their fields are provided at creation time, immutable blocks point only to older blocks. The current GC does not rely on this invariant, but multicore does.

Since the minor heap is filled from higher addresses to lower (allocation is subtraction from the heap pointer), this means that immutable blocks will point to higher addresses. Specifically, multicore relies on the invariant that all fields in the minor heap containing pointers:

point to the major heap, or
point to a higher address in the minor heap, or
were created by caml_modify

This invariant is broken by comballoc, which combines multiple allocations into one, since the current implementation of comballoc actually reverses the allocation order, which caused a nasty bug in multicore (ocaml-multicore/ocaml-multicore#131).

This patch makes comballoc preserve the order of allocation. Here's the diff in the -dlinear output of fun x -> 1 :: 2 :: 3 :: x:

 camlComb__f_1199: {comb.ml:1,6-26}
   x/29[%rbx] := R/0[%rax]
   {x/29[%rbx]*}
   V/30[%rax] := alloc 72 {comb.ml:1,20-26}
+  V/30[%rax] := V/30[%rax] + 48 {comb.ml:1,20-26}
   [V/30[%rax] + -8] := 2048 (init)
   [V/30[%rax]] := 7 (init)
   val[V/30[%rax] + 8] := x/29[%rbx] (init)
-  V/31[%rbx] := V/30[%rax] + 24
+  V/31[%rbx] := V/30[%rax] + -24 {comb.ml:1,15-26}
   [V/31[%rbx] + -8] := 2048 (init)
   [V/31[%rbx]] := 5 (init)
   val[V/31[%rbx] + 8] := V/30[%rax] (init)
-  V/32[%rax] := V/30[%rax] + 48
+  V/32[%rax] := V/31[%rbx] + -24 {comb.ml:1,10-26}
   [V/32[%rax] + -8] := 2048 (init)
   [V/32[%rax]] := 3 (init)
   val[V/32[%rax] + 8] := V/31[%rbx] (init)

The other difference visible above is that the new version uses instr_cons_debug instead of instr_cons, so preserves more location information.

gasche · 2018-07-20T08:01:53Z

As someone who doesn't know much about the backend, I am a bit perplexed by your description.

I don't understand how the -dlinear output, before the patch, would create a pointer into a younger object. It seems that the pointers are created in the instructions of the form val[V/(N+1) + 8] := V/N, and they write the address of the already-initialized (older) N-th block into the cdr field of the (N+1)-th block.

The main difference that I see is that you fill the blocks from the right to the left (where "right" is the larger adresses), instead of filling from the left to the right. I wondered if that directed was related to creation time, but with a linear model of the minor heap, I would rather assume that non-combined allocations are created in increasing address order, giving a layout closer to the previous version that the newly proposed one.

stedolan · 2018-07-20T09:53:57Z

This was confusingly worded! I've updated the description, and hopefully it makes more sense now.

The main difference that I see is that you fill the blocks from the right to the left (where "right" is the larger adresses), instead of filling from the left to the right.

Yes, this is the point of the patch. The desired invariant is roughly "immutable pointers point right".

I wondered if that directed was related to creation time, but with a linear model of the minor heap, I would rather assume that non-combined allocations are created in increasing address order, giving a layout closer to the previous version that the newly proposed one.

This is actually false. On the minor heap, allocations are created in decreasing address order, and allocation is implemented as a subtraction.

gasche · 2018-07-20T11:43:30Z

That clarifies things, thanks!

Now that I understand the code better, I have minor questions on the differences in code generation.

The most obvious difference is that you have one additional arithmetic operation at the first combined allocation, to go the end of the block. I expect the performance cost to be negligible. Regarding code compacity, your are careful to check (if newsz = sz) when there is in fact only one block in total, and not emit there, so it only occurs on multi-block situations, so it should amortize. I'm not worried about that change.

On the other hand, there is a subtler difference in the code generation, which in some sense corresponds to the change from offsets to sizes, in the case where allocstate is a Pending_alloc. With the old code, we would call ourselves recursively with the same reg register that we got in argument; with the new code, you call yourself with i.res.(0) (as in the allocation case). This corresponds to a difference between computing all sub-block addresses by offset from a common base address, or iteratively adding to one sub-address to get the next one.

I wonder if this may cause a performance difference. If I understand correctly, modern processors perform reasoning on data dependencies to run code in a pipelined and parallel way, your new approach introduces much shorter dependencies (each address on the previous one) than the old.

I'm not sure the switch from offsets to sizes is that important to fill the blocks in the opposite order. Couldn't we have kept exactly the same code generation structure as before, but changed the sub-block address computation from reg + ofs to reg + (newsz - ofs), and gotten the same properties? (Maybe that prevents your nice factorization of the "first allocation" and the "above young_wosize" cases?)

stedolan · 2018-09-18T15:48:44Z

I'm not sure the switch from offsets to sizes is that important to fill the blocks in the opposite order. Couldn't we have kept exactly the same code generation structure as before, but changed the sub-block address computation from reg + ofs to reg + (newsz - ofs), and gotten the same properties? (Maybe that prevents your nice factorization of the "first allocation" and the "above young_wosize" cases?)

Honestly, this was just the first way that came to mind. Your approach sounds simpler, I'll try it and see how it looks.

I wonder if this may cause a performance difference. If I understand correctly, modern processors perform reasoning on data dependencies to run code in a pipelined and parallel way, your new approach introduces much shorter dependencies (each address on the previous one) than the old.

You are correct, but the difference will not be noticeable here. On modern x86s, the results of a register/register addition are available on the very next cycle. Due to redundant ALUs, 2-4 independent register/register additions can be dispatched on the same cycle, so there is in principle a performance benefit from avoiding the dependency. However, the x86 code produced by ocamlopt is not tuned to the point where we'd notice the effect of shaving off a fraction of a cycle. (I'm less knowledgeable about other architectures, but generally speaking the benefits of avoiding dependency chains only really show up with operations more time-consuming than a register-register add. BTW, for x86 details, the best references is https://www.agner.org/optimize/)

mshinwell · 2018-09-26T10:17:07Z

Having spent quite a lot of my working life doing code review, I'd like to make a general observation about the process, which I have mentioned to some people in the past. I think it is relevant to the discussion above and also to other pull requests where I have seen a similar pattern.

I believe that the functions of code review should be to establish correctness and maintainability of the code. I think it's a mistake, albeit an easy one to make, to let code review slip into "but maybe it could have been done this way instead". Everyone writes things differently and, so long as those two criteria (correctness and maintainability) are satisfied, I think it is rarely a good use of time to go down this path however well-meaning the intentions of the reviewer. It causes delays, takes time away from other projects and I wouldn't be surprised if it also runs the risk of net introducing more bugs.

In this particular case, this patch (which I think is a blocker for multicore) has been around since mid-July, and from @gasche 's review and @stedolan 's comments, appears to be satisfactory without being rewritten, tested and thought about again. Can we just merge it?

gasche · 2018-09-26T12:37:54Z

I would be willing to follow what @stedolan prefers here (he had and still has the option to say "you know what, let's go with the PR as stated").

I haven't approved the current patch because this code is tricky logic and it takes a non-small effort to convince oneself that it corresponds to the previous one, I haven't done the work yet. Mark, if you are confident that the change is correct, feel free to approve it.

mshinwell · 2018-09-26T12:39:35Z

I will look over it as well then.

mshinwell

This looks correct to me. OK for merging after a couple of small things addressed, I don't need to review again.

mshinwell · 2018-09-27T15:19:41Z

asmcomp/comballoc.ml

-              i.arg i.res i.dbg newnext, ofs)
-          end
+      | _ ->
+         let (newnext, newsz) = combine i.next (Pending_alloc(i.res.(0), sz)) in


Maybe rename newsz -> totalsz, to match the comment etc.?

mshinwell · 2018-09-27T15:20:01Z

asmcomp/comballoc.ml

-                label_after_call_gc = None; }))
-              i.arg i.res i.dbg newnext, ofs)
-          end
+      | _ ->


This match should be exhaustive, | No_alloc | Pending_alloc _ ->.

mshinwell · 2018-09-27T15:24:10Z

asmcomp/comballoc.ml

-              combine i.next (Pending_alloc(reg, ofs + sz)) in
-            (instr_cons (Iop(Iintop_imm(Iadd, ofs))) [| reg |] i.res newnext,
+      | Pending_alloc(reg, totalsz)
+          when totalsz + sz < Config.max_young_wosize * Arch.size_addr ->


Q: Why is the multiplication by Arch.size_addr here? (Yes, I know it was there before...)

A: It looks like I messed up when adding the inline record field called words. It should be called bytes instead. I will make a pull request to fix this. (This shows the value of inline record fields rather than anonymous constructor arguments, especially when units of measure are concerned that do not have their own types.)

mshinwell · 2018-09-27T15:24:58Z

asmcomp/comballoc.ml

@@ -20,9 +20,9 @@ open Mach
 type allocation_state =
    No_alloc                            (* no allocation is pending *)
  | Pending_alloc of Reg.t * int        (* an allocation is pending *)


This should maybe use an inline record in order to give a proper name to the "int". In fact, maybe the names could be sufficiently good that the comment can then be removed.

mshinwell · 2018-09-27T15:28:38Z

asmcomp/comballoc.ml

+      | _ ->
+         let (newnext, newsz) = combine i.next (Pending_alloc(i.res.(0), sz)) in
+         let newnext =
+           if newsz = sz then newnext


I think this might be more immediately clear with something like this, having done the rename of newsz:

let required_offset = totalsz - sz in if required_offset <= 0 then newnext else instr_cons_debug ... required_offset ...

stedolan · 2018-11-14T11:14:40Z

Apologies that this took so long to get to. I've made the coding style changes @mshinwell suggests.

I had originally intended to try @gasche's suggested approach, but it would take much longer, and realistically I'm unlikely to find the time soon. (I've not been working on this code for some time, so it would take a while to remember the details).

mshinwell · 2018-11-14T11:39:25Z

This is ok to merge once CI passes.

stedolan force-pushed the comballoc-ordering branch from 7785823 to f56bba6 Compare July 19, 2018 16:25

damiendoligez added multicore-prerequisite high-priority labels Sep 11, 2018

mshinwell approved these changes Sep 27, 2018

View reviewed changes

stedolan added 2 commits November 14, 2018 10:36

Ensure that comballoc.ml does not reverse the order of allocations.

07a27d2

Coding style changes to comballoc.ml.

6b2a8e1

stedolan force-pushed the comballoc-ordering branch from f56bba6 to 6b2a8e1 Compare November 14, 2018 11:00

mshinwell merged commit 63add57 into ocaml:trunk Nov 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that comballoc.ml does not reverse the order of allocations #1917

Ensure that comballoc.ml does not reverse the order of allocations #1917

stedolan commented Jul 19, 2018 •

edited

gasche commented Jul 20, 2018

stedolan commented Jul 20, 2018 •

edited

gasche commented Jul 20, 2018

stedolan commented Sep 18, 2018

mshinwell commented Sep 26, 2018

gasche commented Sep 26, 2018

mshinwell commented Sep 26, 2018

mshinwell left a comment

mshinwell Sep 27, 2018

mshinwell Sep 27, 2018

mshinwell Sep 27, 2018

mshinwell Sep 27, 2018

mshinwell Sep 27, 2018

stedolan commented Nov 14, 2018

mshinwell commented Nov 14, 2018

Ensure that comballoc.ml does not reverse the order of allocations #1917

Ensure that comballoc.ml does not reverse the order of allocations #1917

Conversation

stedolan commented Jul 19, 2018 • edited

gasche commented Jul 20, 2018

stedolan commented Jul 20, 2018 • edited

gasche commented Jul 20, 2018

stedolan commented Sep 18, 2018

mshinwell commented Sep 26, 2018

gasche commented Sep 26, 2018

mshinwell commented Sep 26, 2018

mshinwell left a comment

Choose a reason for hiding this comment

mshinwell Sep 27, 2018

Choose a reason for hiding this comment

mshinwell Sep 27, 2018

Choose a reason for hiding this comment

mshinwell Sep 27, 2018

Choose a reason for hiding this comment

mshinwell Sep 27, 2018

Choose a reason for hiding this comment

mshinwell Sep 27, 2018

Choose a reason for hiding this comment

stedolan commented Nov 14, 2018

mshinwell commented Nov 14, 2018

stedolan commented Jul 19, 2018 •

edited

stedolan commented Jul 20, 2018 •

edited